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Introduction 



Programs designed to improve the health, nutritional, or cognitive status of preschool children promise 
to take young children at risk and potentially to change their lives. Children whose opportunities in life 
otherwise might be severely diminished are given a chance to have a brighter start and a brighter future. 
But intervention programs inevitably vary in quality and impact. Investment of huge amounts of time, 
effort, and resources do not guarantee that such programs will produce the outcomes that are intended, 
particularly with regard to cognitive development. For this reason, program developers or those who 
fund them recognize the need to evaluate the health, nutritional, and psychological impact of such pro- 
grams. In this review, we deal only with cognitive impact. 

The last 40 years of psychological assessment have witnessed an enormous increase in the number of 
psychological tests designed for the assessment of competencies in infants, toddlers, and preschool chil- 
dren. It is fair to say that, today, early childhood assessment constitutes a growing field with new 
instruments being developed regularly. The purpose of this review is to summarize the quantitative and 
qualitative characteristics of psychological tests and other assessment instruments used to evaluate the 
cognitive functioning of infants, toddlers, and preschool children. 

The review is divided into three parts. The first part summarizes general principles of early childhood 
assessment. The second part describes the major domains in which the various assessment tools can be 
compared, evaluated, and selected. Finally, the third part presents brief descriptions and evaluations of 
selected instruments, 

I. General Principles of Early Childhood Assessment 

Early childhood assessment is guided by five major principles. 

First, no single test can address all questions or solve all problems. Thus, assessment of preschoolers 
ideally should rely on specific instruments for specific situations. Moreover, multiple instruments ought 
to be used, if possible, because no one instrument is likely to provide a complete assessment of all 
intended outcomes. 

Second, there is a greater likelihood of substantial levels of measurement error in early childhood 
assessment than in assessment at any other period of a child’s life. Therefore, accurate childhood 
assessment ideally requires that information from a highly structured assessment be integrated with 
information from other types of more semi-structured, open-ended assessments, such as interviews and 
behavioral observations. The combination of more and less structured assessments raises the probability 
that a complete picture of the data will emerge. 

Third, young children usually perform better in the company of familiar adults. Thus, assessment of 
young children should be viewed as a collaborative enterprise between the assessor arid the child’s 
parents or caregivers. The more standard model of assessment in a sterile environment (such as a bare 
room with no other people besides the tester and test-taker) does not apply as well to young children as 
it does to older ones. 

Fourth, young children’s functioning is profoundly influenced by the setting in which the assessment 
takes place. The implication for assessment is that it is best to observe the child in a range of natural 
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settings, most importantly, the home and the childcare situation. Just as some adults function very dif- 
ferently in the home versus the workplace, all the more do many young children function differently in 
one setting versus another. 

Fifth and finally, assessment should extend over a period of time (for developmental and general/ 
mental-health tasks, 4 to 8 weeks is ideal) in order for assessors to gain a detailed understanding of the 
prevailing emotional themes, the range of functioning in the child and the caregivers, the degree of 
variation in the quality of the child’s primary relationships, and the relative influence of situational fac- 
tors, such as family circumstances and chronic stressors. 

To realize these principles of early childhood assessment, skillful evaluators (1) form alliances with 
caregivers, (2) utilize structured and semi-struetured interviewing techniques, (3) ask questions to 
clarify but not disrupt the caregivers’ accounts, (4) listen to participants and observe the affect they 
demonstrate as well as the content they provide, and (5) guide both children and caregivers through the 
assessment process. 

Knoff (1999) emphasized that assessment should be multi-method, multi-source, and multi-setting. 
Among the many techniques used in early childhood assessment, the three most prevalent ones are test- 
ing, interviewing, and behavioral observations. 

Testing • 

Testing is an assessment technique that employs standardized instruments. For an assessment instru- 
ment to be called a test, it should (a) have a clear purpose (e.g., target a certain psychological domain 
or multiple domains), (b) provide explicit methods to evaluate data (e.g., specify correct and incorrect 
responses), (c) rely on a standardization scheme (e.g., link individual data to population data), (d) meet 
certain psychometric criteria, and (e) have appropriate test format, construction, and administration. 

Interviewing 

Because much of the information about young children’s daily functioning is best delivered by care- 
givers, interviews with primary caregivers are often central to a comprehensive developmental assess- 
ment. The main purpose of interviews with caregivers is to gather information about the child’s 
developmental history and the caregivers’ perceptions of the child’s level of functioning. The important 
areas to cover are (1) the history of the mother’s pregnancy, delivery, and immediate perinatal period; 
(2) the child’s medical history; (3) the child’s developmental milestones; (4) the number, ages, and 
health of family members; (5) the infant’s fit in the family’s daily life; (6) each parent’s interpretation 
of the significance of the child to their lives and the life of the family; and (6) the child’s functioning in 
several areas. 

The following aspects of the child’s development and well-being should be assessed directly and 
evaluated in parental interviews: (a) motor development; (b) general activity level; (c) speech and 
communication; (d) problem solving; (e) play; (f) self-regulation (e.g., independence, initiative, need 
for routines); and (g) relationships with others, including level of social responsiveness. 
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Behavioral Observations 

Assessment of young children should include general descriptions of the children’s behavior and qualita- 
tive accounts of the children’s behavior in at least one structured setting. Special areas of interest are the 
children’s (a) responses to developmental tasks (excitement, positive versus negative affect, energy versus 
lack of energy, quickness versus slowness, and deliberation versus impulsiveness); (b) ability to cope with 
frustration; (c) engagement with the adult world; (d) range of emotional expressiveness; and (e) capacity 
for persistence and sustained attention. Behavioral observations require a mixture of free-floating attention 
to the child’s behavior and the more focused attention to specific situational responses that is inherent in 
any structured or semi-structured assessment. In other words, the assessor should be attentive and sensi- 
tive to whatever occurs, but she or he should also have a mental plan, a mental map that serves as a 
framework for organizing observations collected during the assessment. Key components to be addressed 
in such a framework are (a) the quality of the evaluative environment or “evaluative atmosphere” and the 
affective attitudes of the caregiver and the child; (b) situational involvement (curiosity and interest versus 
detachment and lack of interest), (c) engagement of others (the child’s interactions with the caregiver and 
the examiner, the caregiver’s involvement with the child), and (d) reaction to change (initial greeting, end- 
ing of assessment, transitional periods from task to task, and so on). 

There are at minimum three levels at which the assessor needs to collect her or his information. The 
first and most apparent is the level at which the child’s reactions and responses to the structured assess- 
ment items are registered. The data collected at this level should not be solely confined to whether or 
not the child passes or fails a given task but also should address qualitatively how the child approaches 
and deals with the task. The second level of observation concerns the child’s reaction to the assessment 
situation, apart from the formal tasks. The assessor must register whether the child approaches toys, ini- 
tiates interactions, refers to the examiner or to his or her caregiver, reacts to the assessor and the situa- 
tion at the beginning and the end of the evaluation session, and so on. The third level of observation 
addresses the interaction between caregiver and infant. Observations at this level are made throughout 
the assessment process, and various fluctuations in these interactions are registered. 

Here are some points of observation forjudging caregiver-child interactions: 

Does the child appeal to the caregiver for help and reassurance? 

Does the child call his or her success to the attention of the caregiver? 

Does the caregiver respond to the child’s success/failure? 

How does the caregiver hold and soothe her or his child? 

Does the child distance him or herself from the caregiver to work with tasks or to explore? 

Does the caregiver show his or her involvement? Is the involvement intrusive or encouraging? 
Does the caregiver comfortably assist the child? 

Is the caregiver withdrawn? 

Synthesis 

The synthesis involved in the assessment of young children involves summarizing both qualitative and 
quantitative data gathered from interviews, observations, and testing. Whether assessment has been carried 
out for clinical or research purposes, it is important to keep in mind that there is no single instrument or 
technique that can capture the full variability of the child’s performance in a variety of settings. In work- 
ing with young children, the utilization of the multi-trait — multi-method approach is therefore a must. 
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II. Factors Important for Evaluating and Selecting Tests 

As has been stressed above, the main rule of early childhood assessment is to use a variety of tests. 
Assessors working with young children should not fetter themselves by knowing about only a few tests. 
As the old adage goes, “If all you have is a hammer, all your problems start looking like nails.” Table 1 
lists major domains of young children’s behavior and suggests tests that can be used for each specific 
domain. 

Five distinct issues should be considered when evaluating a developmental test: (1) the extent to which 
the purpose of the test fits the purpose of the assessment, (2) the source of the test data, (3) the quality 
of the test standardization and the relevance of the population on which it was standardized to the popu- 
lation being tested, (4) the psychometric properties of the test, and (5) the qualitative characteristics of 
the test (i.e., test construction, format, and administration). Criteria for evaluating tests for young chil- 
dren are summarized in Table 2. 

Purpose 

Early childhood tests are traditionally used for various purposes, among which are diagnosis, screening, 
intervention planning, and research. Typically, there are different tests available for these different pur- 
poses, but sometimes the same test can be used for different purposes. 

In general, tests administered for diagnostic purposes are administered individually, with the goal of 
obtaining a comprehensive picture of the child’s functioning in a number of areas. Screening tests are 
used primarily when many children are to be assessed and when evaluators desire a relatively brief, 
cheap, and user-friendly instrument that will allow them to identify children, especially infants, who 
may be “at-risk” for developmental delay. Such tests also may be used to evaluate the impact of an 
intervention program. Testing (both individual and group) also can be aimed at identifying objectives 
and steps for individualized intervention programs. Moreover, individual/group testing can be used to 
track children’s achievement of desired goals over time and to monitor the effects of intervention pro- 
grams. Finally, testing carried out for research purposes may be also designed for individual or group 
administration. 

Data 

The main purpose of administering formal tests to children is to collect and organize data regarding 
children’s level of functioning. In working with young children, the most salient types of data are those 
obtained from direct assessment, observation, and caregiver interviews (reports). Most developmental 
tests utilize some combination of all three of these types of data; there are, however, specialized instru- 
ments capitalizing on a single type of data. Each of the three types of data has its strengths and weak- 
nesses. For example, direct assessment typically utilizes standardized types of administration, which 
readily allow a particular child’s performance to be compared to the performance of other children. It is 
crucial that the standardization data be relevant to the children being assessed. Yet, standardized test 
items typically open only a small window into a child’s life, and a child’s performance can be greatly 
influenced by the child’s motivation, mood, comfort, and rapport with the assessor. It is therefore 
important not to draw conclusions from limited test data that go beyond what the data truly can say. 
Observational data, too, have their advantages and limitations. When collected in naturalistic settings, 
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Table 1. Major Domains of Young Children’s Behavior and Tests Utilizable for Assessments of 
These Domains 



Targeted Domains 


Testing 


General Cognitive Abilities 


Wechsler Preschool and Primary Test of Intelligence-Revised (WPPSI-R) 
Differential Ability Scale-Preschool Core (DAS) 

Stanford-Binet Intelligence Scale: Fourth Edition (SBIS: IV) 

McCarthy Scales of Children’s Abilities 
Kaufman Assessment Battery for Children (K-ABC) 

Woodcock-Johnson Psychoeducational Test Battery-Revised (WJ-R) 
Mullen Scales of Early Learning (MSEL) 


Language and Language 
Related Processes 


Peabody Picture Vocabulary Test-Third Edition (PPVT-III) 
Verbal subtest of standardized intellectual batteries 
Preschool Language Scale-Third Edition (PSL-3) 

Test of Language Development-Second Edition (TOLD-2) 
Vineland Adaptive Behavior Scale-Communication Domain 
NEPSY (Language Domain) 

MSEL 


Nonverbal Processing 


K-ABC Simultaneous Scale 

Nonverbal subtests of DAS, WPPSI-R, McCarthy, BSID:IV 


Motor 


Purdue Pegboard 
Vineland Motor Domain 
McCarthy Motor Scale 

NEPSY (Fingertip Tapping, Imitating Hand Positions, Manual Motor 
Sequences, Finger Discrimination) 

• MSEL 


Executive Functions 


WPPSI-R Animal Pegs 

NEPSY (Tower, Statue, Knock and Tap) 


Memory 


DAS (Recall of Digits, Recall of Objects) 

McCarthy Memory Scale 

BISD:IV (Bead Memory, Sentence Memory) 

K-ABC (Face Recognition, Hand Movements, Number Recall, Word Order, Spatial 
Memory) 

NEPSY (Memory for Faces, Memory for Names, Narrative Memory, Sentence 
Repetition, List Learning) 


Social/Emotional Adjustment 


Vineland Socialization Domain 
The Child Behavior Checklist 
Conners’ Behavior Rating Scales 


Preacademic Skills 


K-ABC (Achievement Scale) 

WJ-R 

WPPSI-R (Arithmetic, Information) 

MSEL 

DAS (Matching Letter Like Forms, Early Number Concept) 



Table 2. Criteria for Evaluating Technical Characteristics of Early Childhood Cognitive 
Competence Assessment Devices' 5, 



Criterion 


Specifications 


Evaluations 


Purpose 


1 . Diagnostic 

2. Screening 

3. Intervention planning 

4. Research 




Data 


1. Direct assessment 

2. Observation 

3. Caregiver (parent/teacher) interview/report 




Standardizability 


1 . Standardization 






(1) Size of normative group 

N = 200 per each 1-year interval and N > 2,000 overall 
N = 100 per each 1-year interval and N > 1,000 overall 
Neither requirement above is met 


Good 

Adequate 

Inadequate 




(2) Satisfactoriness of normative data 
Collected in 1988 or later 
Collected between 1978 and 1987 
Collected in 1977 or earlier 


Good 

Adequate 

Inadequate 




(3) Representativeness of the general population by the normative sample 
Normative sample represents the general population on > 5 important 
demographic variables (e.g., gender, nationality) with SES included 
Normative sample represents the general population on > 3 important 
demographic variables with SES included 
Neither criterion is met 


Good 

Adequate 

Inadequate 




2. Norm table age stratification 
1-2 months 
3-4 months 
> 4 months 


Good 

Adequate 

Inadequate 


Psychometric 

Properties 


1. Reliability 

(1) Internal consistency reliability coefficient 
> .90 
.80-. 89 
<80 


Good 

Adequate 

Inadequate 




(2) Test-retest reliability coefficient 
> .90 
.80-. 89 
<80 


Good 

Adequate 

Inadequate 






(continued) 



' Adapted from Algonso & Flanagan, 1999. 
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Table 2. Criteria for Evaluating Technical Characteristics of Early Childhood Cognitive 
Competence Assessment Devices (Continued) 



Criterion 


Specifications 


Evaluations 


Psychometric 


(3) Sample size and representativeness of test-retest sample 




Properties 


N > 100 and represents the general population on > 5 important 






demographic variables 


Good 




N > 50 and represents the general population on > 3 important 






demographic variables 


Adequate 




Neither criterion is met 


Inadequate 


Inadequate 








(4) Age range of the test-retest sample 






< 1 -year interval ' 


Good 




< 2-year interval 


Adequate 




> 2-year interval (or extends beyond the preschool age range 






(i.e., 3 to 5 years) 


Inadequate 




(5) Length of test-retest interval 






< 3 months 


Good 




> 3, but < 6 months 


Adequate 




> 6 months 


Inadequate 




2. Validity (content, criterion-related, and construct) 






3 types of validity evaluated 


Good 




2 types of validity evaluated 


Adequate 




1 type of validity evaluated 


Inadequate 




3. Floors 






Raw score of 1 is associated with a standard score > 2 sd(s) 






below the normative mean 


Adequate 




Raw score of 1 is associated with a standard score < 2 sd(s) 






below the normative mean 


Inadequate 




4. Item Gradients 






No item gradient violations occur or all item gradient violations 






are between 2 and 3 sd(s) below the normative mean 


Good 




All item gradient violations occur between 1 and 3 sd(s) below 






the normative mean 


Adequate 




All or any portio of item gradient violations occur between the 






mean and 1 sd below the normative mean 


Inadequate 



mula). Coefficient alpha refers to the corrected correlation between all possible pairings of split-halves 
of the test. 

Validity refers to the extent to which a test measures what it is supposed to measure. There are several 
different types of validity that are useful in determining the overall quality of a test. Tests are expected 
to demonstrate face, content, criterion-related, and construct validity. Face Validity refers to the extent 
to which the test appears to examinees to measure what it is supposed to measure. Face validity is 








observations can counter the “artificiality” of direct assessment, providing, presumably, more ecologi- 
cally valid data. However, in naturalistic observation, the relative contributions of the child versus the 
context cannot readily be partitioned. Specifically, based on observational data, it is sometimes unclear 
whether the child’s observed behavior is due to a problem within the child or to a problem in the envi- 
ronment in which the child was observed. Similarly, caregivers’ reports may be useful for screening 
purposes, for detecting rare problem behaviors, for obtaining information regarding behavior that may 
be difficult to elicit in the structured assessment, and for assessing the caregivers’ perspectives of the 
children. But such reports bear a stamp of subjectivity and, therefore, should be viewed cautiously 
(Meisels & Waskik, 1990). 

Standardization 

The standardization of data refers to the availability of normative data regarding children’s typical per- 
formance on a given test. Normative data allow an experienced assessor to determine a specific child’s 
placement relative to the normative group. This relative placement is usually expressed in terms of per- 
centile ranks or standard scores. 

The critical issue here is to determine whether a given test’s standardization sample is representative of 
the population of children the test user plans to assess. The norms should be appropriate for the chil- 
dren’s specific historical context, locality, gender, ethnicity, and economic status. For example, U.S. 
norms applied to African data will be of little use. Moreover, it is important to take into account that the 
same item presents different challenges to children in different populations, so that the standardization 
data may not be meaningful when transferred from one population to another. For example, difficult 
vocabulary may be less relevant in the lives of some children than in the lives of others. Some children 
may never encounter these words in their lives, whereas other children may encounter them with some 
frequency. 

Psychometric Properties 

The psychometric properties of psychological instruments assessing characteristics of early develop- 
ment are crucial because (1) young children’s abilities change rapidly and instruments have to be sensi- 
tive to these subtle changes, (2) many traditional instruments represent downward extensions of tests 
that originally were designed for older children and, therefore, were not designed with the preschool 
child in mind; and (3) the use of traditional intelligence tests with young children has been criticized 
extensively due to poor psychometric properties of the tests (Alfonso & Flanagan, 1999). 

The two key psychometric properties of a test are the test’s reliability and validity. In general, reliability 
indicates the test’s dependability (i.e., the test’s ability to produce similar results under differing condi- 
tions). Specifically, (1) test-retest reliability indicates the test’s temporal stability (as shown by the cor- 
relation between test scores obtained at one time with test scores of the same individuals obtained at a 
later time); (2) inter-rater reliability indicates the degree to which test scores are insensitive to individ- 
ual differences between different assessors; and (3) internal consistency indicates the degree to which 
different items of the test measure the same underlying construct (Salvia & Ysseldyke, 1991). Internal 
consistency can be measured in a number of ways. Odd-even reliability refers to the correlation 
between scores on the odd and even numbered items of a test (corrected by the Spearman-Brown for- 
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important because examinees sometimes lack motivation to succeed on tests that do not appear to be 
face valid to the examinees. Content validity refers to the extent to which the test appears to experts to 
measure what it is supposed to measure. Content validity is important because, in its absence, experts 
may not accept the results of the test, regardless of its statistical properties. Criterion-related validity 
refers to the extent to which a test is correlated with other measures with which it is supposed to be cor- 
related. Criterion-related validity can be either predictive or concurrent. Predictive validity refers to the 
extent to which the test predicts criterion performance to be obtained in the relatively distant future. 
Concurrent validity refers to the extent to which the test predicts criterion performance collected con- 
currently or in the very near future (such as the same day or the next day). Psychometricians also some- 
times distinguish between convergent validity and discriminant validity. Convergent validity refers to 
the extent to which a test or test battery correlates with other measures with which it is supposed to cor- 
relate. Discriminant validity refers to the extent to which a test fails to correlate with other measures 
with which it is not supposed to correlate. For example, one would wish a test of intellectual ability to 
predict school performance but not to predict physical prowess. 

A number of general standards for assessing the adequacy of developmental assessment instruments for 
preschoolers have been proposed by Bracken (1987; see also Flanagan & Alfonso, 1995). 

First, assessment tools should possess adequate reliability, as evident by a median subtest internal-con- 
sistency reliability of at least .80 and total test internal-consistency and stability reliability coefficients 
of at least .90. 

Second, developmental tests should have adequate “floor space.” Specifically, tests should have enough 
low-level items so as to allow children to score at least two standard deviations below the mean on the 
overall score as well as on all subtest scores. Adequate tests floors are important for distinguishing chil- 
dren performing at different levels of functioning (average, low average, borderline, retarded). When 
tests floors are inadequate, they (a) provide scores that tend to overestimate the cognitive functioning of 
individuals with various degrees of retardation (i.e., mild, moderate, severe) and (b) provide more infor- 
mation about what a child cannot do than about what a child can do. (We would add that certain tests 
also should have adequate “ceiling space,” with enough difficult items so as to allow children to score 
at least two standard deviations above the mean on the overall score as well as all subtest scores. In 
particular, tests of minimum competencies do not need ceiling space, but tests of the full range of com- 
petencies should allow such space.) 

Third, the subtest item gradient should not be too steep. Specifically, each standard deviation of scores 
in the children’s performance should consist of at least three raw score items. Put another way, large 
differences in standardized scores should not stem from small differences in actual raw-test scores. A 
subtest item gradient, by referring to the amount of change in a child’s standard score that is associated 
with a one-unit change in his/her raw score, indicates the sensitivity of the test (i.e., the test’s potential 
to detect fine gradations in cognitive performance within and across competency levels). If the subtest 
gradient is too steep, then strong conclusions may be made on the basis of raw score differences that 
reflect little more than chance fluctuations in the data. Inadequate test floors and item-gradient viola- 
tions seriously jeopardize the quality of testing instruments, especially when the tests are used with 
young children. 
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Test Construction, Format, and Administration 

The qualitative characteristics of tests for young children (test construction, format, and administration) 
include such characteristics as the appropriateness and attractiveness of testing material for young chil- 
dren, the opportunities for teaching to the tasks being tested, the comprehensiveness of the instructions, 
the appropriateness of the test for multicultural populations, and the testing environment. 

In interpreting test results of children from diverse cultural backgrounds, Newland (1971) suggested 
placing various tests on the product-dominant/process-dominant continuum, where product-dominant 
tests depend on accumulated knowledge and environmental experiences to a greater extent than do 
process-dominant tests, which are assumed to assess fundamental learning and thinking processes. 
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III. Specific Early Childhood Tests 



In order to balance breadth with depth, the following section provides a more extensive summary of 
some tests than of others. When an extensive summary is provided, the presentation of the test (a) pro- 
vides a description of the test, (b) discusses the theory underlying the test, (c) summarizes qualitative 
and quantitative characteristics of the test, (d) addresses the purposes of the test and its reputation in the 
field, (e) comments on the test’s strengths and weaknesses, (f) remarks on North American cultural 
biases that may be embedded in the test, and (g) comments on applications of the test in diverse coun- 
tries. 

Most of the existing infant/toddler/preschooler tests can be clustered into one of three categories, each 
based on a different theoretical and/or psychometric model (Gilliam & Mayes, in press). These groups 
are (1) multidomain development tests, (2) theories-of-cognition-based tests, and (3) criterion (norm)- 
referenced assessments. 

Multidomain Assessment 

The multidomain model is arguably both the most widely used and the oldest model of infant-toddler- 
preschooler assessment. The theoretical basis of the model is that child development is an interactively 
unfolding, continuous process that occurs in several distinct but interrelated domains (Gesell, 1940). 
Traditionally, these domains include (1) motor (fine and gross motor skills), (2) communication (recep- 
tive and expressive language), (3) cognition (problem-solving skills), (4) adaptive competence (self- 
help behaviors — dressing, eating, toileting, and so on), and (5) personal-social competence (social 
competence, emotional regulation, and sense of self). These domains are covered comprehensively in 
some modem assessment devices [e.g., the Bayley Scales of Infant Development (Bayley, 1993); the 
Battelle Developmental Inventory (Newborg, Stock, Wnek, et al., 1984); the Griffiths Mental Develop- 
ment Scales (Griffiths, 1954, 1979); the Mullen Scales of Early Learning (Mullen, 1995); and the 
NEPSY (Korkman, Kirk, Kemp, 1998)] and partially in others [e.g., the Receptive-Expressive Emer- 
gent Language Scale-2 (Bzoch & League, 1991); the Peabody Developmental Motor Scale (Folio & 
Fewell, 1983); and the Vineland Social-Emotional Early Childhood Scales (Sparrow, Balia, & Cicchetti, 
1998)]. 

Standardized Tests 

The Brazelton Neonatal Behavioral Assessment Scale (2 nd ed.). The Brazelton Neonatal Behavioral 
Assessment Scale, 2 nd ed. (NBAS-2, Brazelton, 1984), is a popular test of the neonate’s current organi- 
zational and coping capacities in response to the stress of labor, delivery, and adjustment to the extra- 
utero environment. The NBAS-2 is designed for use with neonates 37 to 44 weeks of gestation age who 
do not require mechanical supports or oxygen. The test takes 20-30 minutes to administer (and about 15 
minutes to record and score the infant’s performance). Multiple assessments are recommended, with the 
first taking place at no earlier than three days of life. The NBAS data permit the construction of com- 
posite (factor and summative) scores, but the procedure is complicated and time-consuming. The NBAS 
inter-rater reliability estimates are quite high, but the indicators of the test-retest reliability are rather 
low, suggesting poor temporal stability for most items (Sameroff, 1978). The NBAS was originally 
designed for use with full-term healthy infants, but it has been used most extensively with premature 
and otherwise medically at-risk infants. The validity of the NBAS is supported by research that has 



demonstrated its ability to discriminate groups of underweight, intra-utero drug or alcohol-exposed, 
gestationally diabetic, or intra-utero malnourished neonates as compared with the normative sample. 

The NBAS has been shown to predict infant-parent attachment and subsequent infant development, but 
primarily within the first year of life (Horowitz & Linn, 1982). 

The Bayley Scales of Infant Development — II. The Bayley Scales of Infant Development-II (BSID; Bay- 
ley, 1993) represent the first restandardization of the Bayley test in 25 years. This scale is arguably the 
most widely used measure of the development of infants and toddlers. In addition, the BSID has an 
extensive psychometric history and a very respectable track record. The BSID-II is applicable to chil- 
dren from 1 through 42 months of age. The administration takes about 25 to 35 minutes for infants 
under 15 months of age and up to 60 minutes for children over 15 months. 

The major difference between the older and the revised versions of the BSID is that the BSID-II is 
administered in “item sets,” which are sets of items selected based on the age of the infant, whereas the 
original BSID used a continuous series of items. This modification created some confusion among 
infant assessors. For example, it is unclear which “item set” to use for infants bom prematurely (Ross 
& Lawson, 1997a) or for infants living in cultural settings that differ from those of the infants in the 
normative sample (Gilliam, in press; Gilliam & Mayes, in press). For testers that adopt the corrected 
age procedure (for example, if a baby was bom two months prematurely, and is 9 months old, it should 
be administered the set of items for 7-month-olds), the test developers recommend using the same 
“item set” that corresponds to the normative group used for determining that child’s score. Specifically, 
if an infant’s performance is to be compared to that of a typical 7-month-old, the examiner should 
administer the 7-month-old item set (Matula, Gyurke, & Aylward, 1997). 

The BSID-II consists of three components: the Mental Development Index (MDI), the Psychomotor 
Development Index (PDI), and the Behavior Rating Scale (BRS). The MDI assesses the child’s lan- 
guage development and problem-solving (cognitive) skills; the PDI assesses the child’s gross and fine 
motor development; the BSR provides information on the child’s behaviors during the assessment. The 
BSID-II permits obtaining age equivalence scores for four facets of development: Cognitive, Language, 
Social, and Motor. 

The BSID-II was normed on 1700 infants representative of 1988 U.S. Census data. Test-retest reliabili- 
ties for time periods of 1 to 16 days range from .83 to .91 for the MDI and from .77 to .79 for the PDI. 
Stability for the BRS varies greatly depending on the age of the child, ranging from .55 to .90. Inter- 
rater reliability indicators for the BSID-II were reported to be .96 for the MDI, .75 for the PDI, and .70 
for the total BRS. The total test internal-consistency reliability coefficients of the BSID-II are adequate, 
ranging from .89 (at ages of 2 1/2 years and 3 years) and .90 (at an age of 3 1/2 years). Concurrent 
validity of the MDI, as compared to other measures of general cognitive ability, typically falls in the 
.70 range, whereas the highest correlation between the PDI and other indicators of cognitive ability was 
.59. The norm tables for the BSID-II are adequate and are divided into one-month age blocks for chil- 
dren aged 36 to 42 months. 

The Mental scale of the BSID-II has an adequate floor and good item gradients. This scale yields stan- 
dard scores greater than 2 standard deviations below the mean for children between the ages of 2 1/2 
years and 3 1/2 years. In addition, the scale has a number of items below the entry level for a child 
aged 2 1/2 years, providing adequate floors. The validity data on the BSID-II are still being accumulat- 
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ed, with the first data suggesting that validity of the instrument is adequate (e.g., Gerken, Eliason, & 
Arthur, 1994). 

The qualitative characteristics of the BSID-II range from adequate to good. The manual contains infor- 
mation on the BSID-II theoretical framework, administration, scoring, and score interpretation. The 
instrument utilizes attractive and stimulating material and successfully alternates item types (e.g., lan- 
guage, visual, visual-motor). The administration time is about 60 minutes. The expressive language 
demands are minimal, but gestures are not acceptable responses. The test utilizes elements of dynamic 
testing, allowing for multiple trials and demonstrations. The BSID-II allows some flexibility in admin- 
istration and includes rules for when to discontinue testing. The instructions are short and concise, even 
though the basic concept load (i.e., the number of basic concepts that are expected to be mastered by a 
child at a given age) is fairly high (28; Alfonso & Flanagan, 1999). The BSID-II has not yet been trans- 
lated into languages other than English and at this time there are not yet norms from different cultures. 
Given that the BSID-II utilizes many objects from the traditional toy world of a North-American child, 
the amount of Western acculturation required to perform well on the test is relatively high. 

The history of research with the BSID is replete with empirical demonstrations of both the usefulness 
and the futility of the data collected. On the one hand, the BSID has proven to be useful for the assess- 
ment of the current status of the infant (Lipsitt, 1992). On the other hand, the testing of children 
younger than 18 months of age with the BSID has yielded little predictive validity, at least to the extent 
that one is interested in anticipating the later intellectual or cognitive development of a given child 
(Colombo, 1993). As a matter of fact, Dr. Bayley herself expressed reservations about the use of the test 
for predictive purposes, suggesting that researchers examine the mother rather than the child. 
Researchers have arrived at the conclusion that, for children younger than 18 months, the BSID does 
not yield consistent results. In addition to lacking predictive power in the domain of intelligence, the 
Bayley does not predict either the child’s behavioral scores or the child’s psychiatric diagnoses (Dietz, 
Lavigne, Arend, & Rosenbaum, 1997). Bums et al. (1992) have made the case from their data that 3- 
month-olds are more like other 3-month-olds across a variety of tasks than they are like themselves 
over a long period of time. In other words, at 3 months of age, a normally-developing child and a men- 
tally-retarded child appear to be very similar in their capacities as assessed by the BSID. When they 
reach the age of 3, however, they are very different. To sum up, the BSID appears to have some ability 
to predict which infants will score very poorly on intelligence tests before the age of 3, but shows a 
limited ability to accurately predict specific IQ scores, especially in average developing infants (Gibbs, 
1990; Whatley, 1987). 

The NEPSY. The NEPSY (NE from neuro and PSY.ffom psychology, Korkman, Kirk, & Kemp, 1998) is 
a comprehensive battery designed to assess neuropsychological development in preschool and school- 
age children. The battery was designed to assess basic and complex aspects of cognitive development 
critical to children’s ability to leam and be productive both inside and outside of school settings. 

The NEPSY includes a set of neuropsychological subtests that can be used in various combinations. 

The assessment is carried out in five domains: Attention/Executive Functions, Language, Sensorimotor 
Functions, Visuospatial Processing, and Memory and Learning. The NEPSY’s subtests are divided into 
core subtests and expanded subtests. , 
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The theoretical background of the NEPSY is that of the Luria tradition of assessment (Christensen, 
1984; Luria, 1973). According to this tradition, cognitive functions, such as attention and executive 
functions, language, movements, visuospatial abilities, and learning and memory, are complex capaci- 
ties. They are composed of flexible and interactive subcomponents that are mediated by equally flexi- 
ble, interactive neural networks. Therefore, it is important to identify and assess, as far as possible, both 
basic and complex subcomponents that contribute to performance within and across functional 
domains. The designers of the NEPSY capitalized on Luria’s approach: Some subtests were designed to 
assess basic subcomponents of a complex capacity within single functional domains; other subtests 
were designed to assess subcomponents of cognitive functions that require contributions from several 
functional domains. • * 

The NEPSY was standardized on a sample of 1,000 children (100 children in each of 10 age groups 
ranging in age from 3 through 12 years). The sample was representative of the U.S. general population 
on four indicators (gender, race/ethnicity, geographical region, and parent education level). All NEPSY. 
subtest scores are z-transformed with a mean of 10 and standard deviation of 3. 

Internal-consistency indicators and test-retest reliabilities range from inadequate (.42) to good (.91). 
Validity information on the NEPSY has been accumulated; initial reports (Korkman, Kirk, & Kemp, 
1998) indicate the presence of only low-to-moderate correlations with core tests of cognitive function- 
ing (e.g., WPPSI-R, WISC-III, and BSID-II). 

Qualitative properties ,of the NEPSY are adequate. Test materials are engaging; tasks are interesting and 
stimulating. The NEPSY’s manual is comprehensive and the instructions for administration are clear. 

To ensure that test-takers understand the instructions, the test utilizes examples and probes. The appli- ; 
cation of a discontinuation rule (i.e., a rule of interrupting a subtest after a certain number of failures) 
helps avoid frustration in children. 

The NEPSY is one of the newest multi-domain batteries available on the market. It was initially created 
in Finnish. A parallel version was then developed in English. The battery is currently being scrutinized 
by a number of evaluators in different studies. , 

The Griffiths Mental Development Scales. The Griffiths Mental Development Scales are most popular 
outside of North America. The Griffiths consists of two tests: the Abilities of Babies (Griffiths, 1954), 
designed for infants from birth to 24 months, and the Abilities of Young Children (Griffiths, 1979), for 
children 24 months to 8 years old. The infant scale consists of five domains, modeled closely after the 
scales in Gesell’s early work: Locomotor, Personal and Social, Hearing and Speech, Eye and Hand 
Coordination, and Performance. The test was normed on 571 infants from London, England. Reliability 
studies for the Griffiths have yielded mixed results, and validity studies have indicated relatively weak 
predictive ability for later IQ test scores (Thomas, 1970). 

The Battelle Developmental Inventory. The Batelle Developmental Inventory (BDI, Newborg, Stock, 
Wnek, Guidubaldi, & Svinicki, 1984) assesses the development of children from birth through eight 
years of age. The BDI assessment time is longer than that for most similar tests, ranging, depending on 
the age of the child, from 1 to 2 hours. The BDI assesses development in the Personal-Social, Adaptive, 
Motor, Communication, and Cognitive domains. Each BDI domain is also divided into 5 subdomains 



(to provide fine-grained information within each domain). When all domains are evaluated, the BDI 
produces a total developmental score. 

The BDI was normed on a sample of 800 U.S. children. Several psychometric concerns were raised in a 
review of the BDI (McLinden, 1989). First, although the test authors reported exceptionally strong test- 
retest and inter-rater reliability for the BDI, a general lack of procedural details in the manual makes it 
difficult to evaluate these data adequately. Second, there is no information regarding the internal- consis- 
tency reliability of the BDI. Third, the concurrent validity studies were exceptionally small with respect 
to sample sizes, and the magnitudes of the correlations were rather low. Fourth,; and most important, 
researchers have expressed major concerns about the BDI’s normative data (Boyd, 1989). For the first 
two years, the BDI’s normative data are presented in 6-month groups, whereas thereafter they are provid- 
ed in 12-month groups. Therefore, a child’s performance is compared to that of others who can be as 
many as 6 months older or younger for children under 24 months, or as many as 12 months older or 
younger for children older than 24 months. This lack of precision in the normative data tends to inflate 
standard scores for children who are old for their normative groups and to deflate standard scores for 
children who are young for their normative group (Boyd, Welge, Sexton, & Miller, 1989). 

The Mullen Scales of Early Learning ( MSEL ; Mullen, 1995). The Mullen Scales of Early Learning are 
designed to assess children’s development from birth to the age of 68 months. The Mullen takes about 
15 to 60 minutes to administer, depending on the age of the child. The theoretical basis of the Mullen is 
a model of infant neurodevelopment, according to which the child should be assessed in five different 
domains: Gross Motor, Visual Reception (primarily visual discrimination and memory), Fine Motor, 
Receptive Language, and Expressive Language. The domain data can be combined into the overall 
Early Learning Composite score. The composite score based on the four cognitive scales (all but the 
Gross Motor Scale) represents so-called general intellectual ability. 

Normative data for the Mullen are based on a sample of 1,849 children from the U.S. A. Internal- 
consistency reliabilities range from .75 to .83 for Mullen subtests and the reliability is .91 for the Early 
Learning Composite.' Test-retest reliabilities range from .78 to .96, depending on the subtest. Inter-rater 
reliabilities range from .94 to .98. The indicators of concurrent validity are adequate. 

The qualitative characteristics of the MSEL are adequate. The test contains many stimulating items that 
engage and maintain the interest of young children. The manual provides detailed descriptions of all 
manipulations, of scoring, and of how to interpret results. 

The Peabody Picture Vocabulary Test-Revised (Dunn & Dunn, 1981). The Peabody Picture Vocabulary 
Test-Revised (PPVT-R) was originally developed in 1959 and then was revised in 1981. It is a nonver- 
bal, multiple-choice test designed to evaluate the receptive vocabulary (assessed through hearing and by 
indicating “yes” or “no”) of children and adults. The test is administered to individuals from age 2 1/2 
years through adulthood. The PPVT-R requires no reading skill. The test is untimed. Testing time is 10- 
15 minutes. 

The PPVT-R words were selected to be of equal numbers of nouns, gerunds, and modifiers in approxi- 
mately 19 content categories. The words also were carefully selected to avoid gender, culture, religion, 
or race biases. The PPVT-R has two forms, L and M, with 175 plates in each form. Each plate contains 
four pictures. Items are arranged in increasing levels of difficulty. The two forms use different words in 
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pictures. The pictures are clearly drawn, and are free of fine details and interferences of background 
with the key stimulus or figure of the picture. Raw scores, are converted into standard scores, with mean 
of 100 and standard deviation of 15. 

Normative data were collected from a representative U.S. sample of 4,200 youths (aged 2 1/2 years 
through 18 years), and 828 adults (aged 19 through 40 years). Estimates of PPVT-R internal consisten- 
cy, alternate-form reliabilities, and split-half reliabilities are somewhat lower than desirable, with most 
estimates between .75 and .85 (Slatter, 1992). 

A number of studies have attempted to validate the PPVT-R. Specifically, the range of the PPVT- 
R/WISCR-R correlations is .16 to .86; the range of correlations between the PPVT-R and various mea- 
sures of reading, language, and general achievement is .30 to .63 (for details, see Sattler, 1992). There 
is an ongoing discussion in the literature about what it is the PPVT-R measures: The test has been 
described as measuring language ability, verbal comprehension, vocabulary ability, receptive language, 
recognition vocabulary, verbal intelligence, vocabulary comprehension, vocabulary usage, comprehen- 
sion of single words, single-word hearing vocabulary, single-word receptive vocabulary, and intelli- 
gence. Overall, the consensus has been reached that PPVT-R scores are not interchangeable with IQs. 

A number of PPVT-R items have been found to be culturally biased against ethnic minorities 
(Argulewicz & Abel, 1984; Reynolds, Willson, & Chatman, 1984). Therefore, the PPVT-R should be 
applied with caution in populations other than North American native speakers of English. 

Behavior Rating Scales 

Behavior rating scales typically require evaluation of a number of specific behaviors or correlates of 
behavior organized into empirically-derived factors or scales. These scales are expected to provide 
information about a child across social-emotional or adaptive behavior dimensions. Typically, such 
scales have originated from literature reviews and/or the assessment of behavioral peculiarities of chil- 
dren who are representative of the target population of the rating scale. These descriptors are then fac- 
tor-analyzed and the resulting scales are evaluated for their psychometric soundness (i.e., reliability and 
validity). 

A typical administration of a behavioral rating scale includes respondents’ (usually parents’ and teach- 
ers’) ratings of the degree to which certain behavioral descriptors are present in the child’s behavior 
(e.g., “Does your child tease other children? /often, sometimes, never”). The responses are summed up 
across the behavioral factors and then compared to some standardization sample or reference group. 

Child Behavior Checklist. The Child Behavior Checklist exists in five different forms: the Child Behav- 
ior Checklist (ages 2 to 3), the Child Behavior Checklist (ages 4 to 18), the Teacher’s Report Form 
(ages 5 to 18), the Youth Self-Report (ages 11 to 18), and the Direct Observation Form (ages 5 to 14). 
Of these instruments, only the first two forms, completed by parents, can be used with preschoolers. 

The Child Behavior Checklist for ages 2 to 3 (CBCL/2-3; Achenbach, 1991) consists of 99 items and 1 
open-ended item describing various behaviors, emotional problems, or reactions to specific situations. 
The CBCL items are rated on a 3-point scale (2 — Very true or often true; 1 — Somewhat true or some- 
times true; and 0 — Not true at the present time or over the last two months). The CBCL/2-3 has three 
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global scales (Total Problems, Internalizing, and Externalizing) and six narrow-band scales (Social 
Withdrawal, Depressed, Sleep Problems, Somatic Problems, Aggressive, and Destructive). 

The Child Behavior Checklist for ages 4 to 18 (CBCL/4-18; Achenbach, 1991) is a standardized parent- 
report measure of children’s adaptive competencies and problem behaviors that is widely used interna- 
tionally in clinical and research settings. The measure consists of 20 competence items and 118 items 
describing behavioral/emotional problems. The total Competence scale includes scores on the scales of 
Activities, Social, and School. The problem behaviors are scored on eight factor-based, narrow-band 
scales: Withdrawn, Somatic Complaints; Anxious/Depressed, Social Problems, Thought Problems, 
Attention Problems, Delinquent Behavior, and Aggressive Behavior. In addition, two broad-band scales 
can also be scored: Internalizing and Externalizing. Finally, there is also a Total Problem Score that 
provides an overall index of the number and severity of reported problem behaviors. 

The CBCL manual (Achenbach, 1991) contains details regarding standardization, factor analyses, and 
other psychometric properties. The CBCL/2-3 and CBCL/4-18 appear to be psychometrically strong. 
The scale statements are written at the fifth-grade reading level. The completion of the scales usually 
takes about 20 minutes. The CBCL scales are important instruments to be used in a preschool assess- 
ment battery. They evaluate children’s competencies and problem behaviors and provide additional 
information that cannot be obtained through cognitive testing. 

The Conners’ Behavior Rating Scales. The Conners’ Behavior Rating Scales consist of the Conners 
Parent Rating Scales (CPRS) and the Conners’ Teacher Rating Scales (CTRS). These scales were origi- 
nally developed to identify hyperactive children but have since been expanded to identify children with 
other, related behavioral problems. Each scale is available in two forms, short and long. The CTRS 
items are rated along a 0 to 3 scale (0 = Not at All; 1 = Just a Little; 2 = Pretty Much, 3 — Very Much). 
The parent-version items are structured into five scales: Conduct Problem, Learning Problem, Psycho- 
somatic, Impulsive-Hyperactive, and Anxiety. The teacher-version items form three scales: Conduct 
Problem, Hyperactivity, and Inattentive-Passive. Psychometric properties of the CPRS range, for differ- 
ent scales, from poor to adequate. 

The Vineland Adaptive Behavior Scales. The Vineland Adaptive Behavior Scales (VABS; Sparrow, 
Balia, & Cicchetti, 1984) exist in three versions (the Survey Form, the Expanded Form, and the Class- 
room Edition). This instrument is used to assess the ability of handicapped and noiihandicapped chil- 
dren to perform the daily activities required for personal and social sufficiency from the ages of birth 
through 19 years. The VABS (all three versions) measure adaptive behavior using four specific 
domains: Communication, Daily Living Skills, Socialization, and Motor Skills (with this last scale 
administered only to children 0-6 years of age). Each of the four primary VABS domains contains spe- 
cific subdomains: The Communication domain is divided into receptive, expressive, and written subdo- 
mains; the Daily Living Skills domain is divided into personal, domestic,. and community subdomains; 
the Socialization domain is divided into interpersonal relationships, play and leisure time, and coping 
skills subdomains; and the Motor Skills domain is divided into gross and fine motor subdomains. The 
subscale scores are added up to yield an Adaptive Behavior Composite. The Survey and Expanded 
Forms of the VABS have an optional Maladaptive Behavior subscale. The Survey Form consists of 297 
items and takes up to 60 minutes to administer; the Expanded Form consists of 577 items and takes 
about 90 minutes to administer. Both forms are' administered as semi-structured interviews with a par- 
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ent or significant caregiver. The Classroom Edition consists of 244 items and takes about 20 minutes to 
complete. This edition is' completed by a teacher. . 

For the Survey and Expanded Forms, the standardization was carried out on a 3,000-individual U.S. 
sample, representative of the general population by gender, race/ethnicity, community size, geographi- 
cal region, and SES. A similarly stratified sample of 3,000 students was used for the Classroom 
Edition. The median split-half reliabilities of the VABS range from .83 to .9'5; test-retest reliabilities 
are in the .80s and .90s; inter-rater reliabilities range from .62 to .75. Validity indicators have been 
found to be satisfactory (e.g., Atkinson, Bevc, Dickens, & Blackwell, 1992). However, it appears that, 
for preschoolers, some of the domains may have inadequate floors (Knoff, Stollar, Johnson, & 
Chenneville, 1999). 

Theories-of-Cogniiion-BaSed Tests 

The majority of intelligence tests have been developed within the psychometric paradigm, an approach 
based on the identification of abilities (verbal and spatial abilities, memory, reasoning, etc.) through the 
factor analysis of sets of diverse cognitive tasks. Most modem psychometric tests (but not all, as 
shown below) address both a general factor (the so-called g-factor, reflecting the positive manifold of 1 
correlations between various cognitive abilities) and distinct, though correlated, group factors. Whereas 
all of the psychometric tests have a full-scale or a composite index that, presumably, reflects the 
g-factor, no single test completely overlaps with any other test in terms of all the cognitive abilities ' 
that are measured. 

McGrew and Flanagan (1996) conducted a review of tests of intelligence for young children and classi- 
fied the subtests of all major intelligence batteries according to the Hom-Cattell Gf-G c theory (Horn, 
1991, 1994; Horn & Noll, 1997) and the Three-Stratum Theory of Cognitive Abilities (Carroll, 1993, 
1997). Presumably, such a classification provides a unified theoretical scheme for comparing and con- 
trasting different tests of intelligence (Alfonso & Flanagan, 1999). 

The Wechsler Preschool and Primary Scale of Intelligence— Revised. The Wechsler Preschool and Pri- 
mary Scale of Intelligence — Revised (WPPSI-R; Wechsler, 1989) is the most recent version of a test 
that was initially developed in the late 1960’s (Wechsler, 1967). The test is an individually administered 
clinical instrument for assessing the intelligence of children aged 3 years through 7 years, 3 months. 

The WPPSI-R is organized so that one group of subtests (Information, Comprehension, Arithmetic, 
Vocabulary, Similarities, and Sentences) yields a Verbal IQ and another group of subtests (Object 
Assembly, Geometric Design, Block Design, Mazes, Picture Completion, and Animal Pegs) yields a 
Performance IQ. The Verbal and Performance IQs combine to yield the Full Scale IQ, which is inter- 
preted as a measure of so-called general intellectual functioning. Descriptions of all WPPSI-R subtests 
can be found in Gyurke (1991) and in Wechsler (1989). 

The WPPSI-R subtests have a mean of 10 and a standard deviation of 3. The Verbal, Performance, and 
Full scales have a mean of 100 and a standard deviation of 15. Administration time for the WPPSI-R is 
usually somewhat greater than one hour. 

Overall, the WPPSI-R displays adequate standardization characteristics, with a total standardization sam- 
ple of 1,700 individuals, 400 per each one-year interval. The sample is representative of the general U.S. 
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population with respect to gender, race/ethnicity, geographic location, and SES. The WPPSI-R norms 
were collected in 1984-1985; thus, current WPPSI-R scores may slightly overestimate cognitive ability. 

The total internal-consistency reliability coefficients of the WPPSI-R are .95 or greater. The WPPSI-R: 
test-retest reliability is estimated at .91, but the procedures used to obtain it are considered inadequate, 
given that the test-retest reliability sample comprised children past the preschool age. 

The WPPSI-R’s floors are inadequate at the early age levels of the test (i.e., 2 years and 11 months of 
age and slightly above). All subtests have adequate floors only by the age of 4 years and 9 months. The 
subtest item gradients are generally adequate. 

; & 

The WPPSI-R has adequate validity. Researchers conducted several factor-analytic studies on the 
WPPSI-R data (e.g., Allen & Thorndike, 1995; Stone, Gridley, & Gyurke, 1991) and consistently 
obtained a two-factor solution, supporting the claim that a Verbal-Performance dichotomy underlies the 
WPPSI-R. -if . ■ 

There is much validity data available. Dozens of studies have been conducted that compare the WPPSI- 
R with other major intelligence tests, such as the Fourth Edition of the Stanford-Binet — SB-IV (median 
r = .78, Thorndike et al.,, 1986b), the McCarthy Scales of Children’s Abilities (r = .86, Sattler, 1992), 
the Woodcock-Johnson Revised — WJ-R (e.g., r = .70, Harrington, Kimbrell, & Dai, 1992), and the Dif- 
ferential Aptitude Scales — DAS (e.g., r = .74 with DAS; Elliott, 1990b). In general, correlations 
between total scores on the WPPSI-R and scores on other instruments have been moderate to high 
(McGrew& Flanagan, 1996). 

The qualitative characteristics of the WPPSI-R range from poor to adequate (Table 2). The WPPSI-R 
Manual (Wechsler, 1989) provides adequate information about the development of the instrument, its 
underlying constructs, administration and scoring procedures, and interpretations., The test materials are 
engaging and are likely to attract the attention of a preschool child; moreover, some subtests , include 
stimulating initial tasks (i.e., Object Assembly). The WPPSI-R nicely alternates verbal and nonverbal 
subtests. Moreover, this test successfully utilizes elements of dynamic testing by including teaching 
items, second trials, and demonstrations by the examiner. 

However, the WPPSI-R has a number of drawbacks. The major one is the length of the test: Many 
young children cannot remain focused and attentive during the entire administration of the test. The 
WPPSI-R does not provide alternative stopping rules, which makes the test rather frustrating for young 
children. In addition, many WPPSI-R subtests rely heavily on extensive expressive language skills 
(e.g., Comprehension and Vocabulary). Moreover, the WPPSI-R does not build in gestures as possible 
answers. In addition, the WPPSI-R directions are complex and unnecessarily include many basic con- 
cepts (up to 42, according to Flanagan et al., 1995). 

There are no core directions on how to translate and adapt the test to cultures other than the North 
American one, and there are no norms for children from other cultures. As for the degree of the impor- 
tance of North- American acculturation necessary for successful performance on the test, it is assumed 
to be high for the subtests constituting the Verbal scale, and moderate-to-low for the subtests constitut- 
ing the Performance scale. Thus, the WPPSI-R is of limited utility in the evaluation of children from 
diverse cultural backgrounds. 
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Overall, the WPPSI-R possesses strong psychometric characteristics; in many respects, it is quantita- 
tively the strongest instrument in the field of early child assessment (Kamphaus, 1993). Yet, the test is 
too long, its administration is too complex, it does not adequately handle failure-related frustration, the 
acceptability of answers is heavily dependent on the child’s expressive language, and it is over-reliant 
on the values of North- American culture. 

The Wechsler Intelligence Scale for Children — Third Edition. The Wechsler Intelligence Scale for Chil- 
dren Third Edition (WISC-III; Wechsler, 1991) is the most current edition of a test that was initially 
developed in the late 1940s (Wechsler, 1949). This test is an individually administered clinical instru- 
ment for assessing the intellectual abilities of children aged 6 years through 16 years, 11 months. The 
instrument consists of three main composite scores: Verbal IQ (comprising Information, Similarities, 
Arithmetic, Vocabulary, Comprehension, and Digit Span subtests), Performance IQ (comprising Picture 
Completion, Coding, Picture Arrangement, Block Design, Object Assembly, Symbol Search, and Mazes 
subtests), and Full Scale IQ. Although based originally on a conception of intelligence that emphasized 
the pervasive nature of so-called general intelligence, the current edition of the WISC offers scores for 
four factors (Verbal Comprehension, Perceptual Organization, Processing Speed, and Freedom From 
Distractibility). 

The Stanford-Binet Intelligence Scale: Fourth Edition. The Stanford-Binet Intelligence Scale: Fourth 
Edition (SBIS-IV, Thorndike, Hagen, & Sattler, 1986a, 1986b) is an individually administered intelli- 
gence test used to assess the cognitive abilities of individuals from age 2 years to adult. The Fourth 
Edition is the latest version of the Stanford-Binet, which was originally published in 1916. The SBIS- 
IV is based on a three-level hierarchical model consisting of “g” (a general reasoning factor) and three 
second-order factors (Crystallized Abilities such as Verbal Reasoning and Quantitative Reasoning, 
Abstract/Visual Reasoning, and Short-Term Memory). The Verbal Reasoning area score is derived from 
the Vocabulary, Comprehension, and Absurdities subtests; the Abstract/Visual Reasoning score is 
derived from the Pattern Analysis and Copying subtests; the Quantitative area score is derived from the 
Quantitative subtest; and the Short-Term memory score is based on the Bead Memory and Memory for 
Sentences subscores (for subtest descriptions, see Delaney & Hopkins, 1987; Glutting & Kaplan, 1990). 
Scores from one or more of the SBIS areas are combined to yield the Test Composite (a measure ofg). 

SBIS subtest scores have a mean of 50 and a standard deviation of 8. Area scores and the Composite 
scores have a mean of 100 and a standard deviation of 16. The SB-IV was standardized on 5000 indi- 
viduals with at least 200 individuals per one-year interval. The normalization sample was, as a whole, 
representative of the U.S. population (in terms of gender, geographic region, race/ethnicity, and com- 
munity size), with somewhat under-represented low SES and over-represented high SES participants; 
no data on age-specific representativeness are available. 

The total test internal-consistency reliability coefficients are good, with the lowest coefficient being .95. 
The test-retest reliabilities are also good (.91), but may be biased because the sample suffers from non- 
representativeness. The most significant psychometric limitations of the SBIS-IV are subtest and area 
floors and item gradients. All subtests (and, correspondingly, all areas) appear to have inadequate floors 
for 0-2-year-olds and the floors become adequate only at about age 5; thus, the SB-IV is inadequate for 
the assessment of very young children. The item gradients are inadequate for six of the eight subtests 
for preschoolers (i.e., Comprehension, Absurdities, Bead Memory, Quantitative, Copying, and Pattern 
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Analysis) at ages 2.6 to 3.5; thus, fine gradations in ability may not be detected on most subtests of the 
SB-IV, especially in the case of very young children (Alfonso & Flanagan, 1999). 

Overall, the validity of the SB-IV is considered adequate. However, the quality of the construct-validity 
evidence has been questioned (Glutting & Kaplan, 1990; Kaplan & Alfonso, 1997). Specifically, 
although there are convergent data suggesting that the Test Composite of the SB-IV can be interpreted 
as a measure of general intelligence, there is considerable controversy regarding the factor structure of 
the instrument (in terms of both factor number and factor loadings). Out of many studies attempting to 
investigate the factor structure of the SB-IV, not a single one has supported the test authors’ claim of its 
invariance across age (e.g., Keith, Coll, Novak, White, & Pottebaum, 1988; Kline, 1989; Molfese, 

Yaple, Helwig, Harris, & Connell, 1992). Despite the lack of agreement regarding the construct validity 
of the SB-IV across the age range of the test, however, there is general consensus regarding the two- 
factor structure of the SB-IV for preschoolers (e.g., Molfese et al., 1992; Ownby & Carmin, 1988), with 
Vocabulary, Comprehension, Absurdities, and Memory for Sentences forming a Verbal Comprehension 
factor and Pattern Analysis, Copying, Quantitative, and Bead Memory forming a Nonverbal Reasoning/ 
Visualization factor (Sattler, 1992). As for criterion validity, dozens of studies have been carried out 
comparing the SB-IV with other major intelligence tests, such as the WPPSI-R (e.g., Carvajal et al., 
1988; McCrowell & Nagle, 1994), WISC-R (e.g., Brown & Morgan, 1991; Phelps, Bell, & Scott, 

1988), K-ABC (e. g., Lamp & Krohn, 1990; Rothlisberg, & McIntosh, 1991), and others (for more 
detail, see Appendix). In general, correlations between the total scores of the SB-IV and other instru- 
ments have been moderate to high (McGrew & Flanagan, 1996) and test scores for exceptional groups 
(e.g., learning disabled, gifted) have been essentially similar (Sattler, 1992). Therefore, the SB-IV 
appears to be a valid measure of many aspects of intellectual functioning for children of most ages and 
for a variety of exceptional subpopulations. 

The qualitative characteristics of SB-IV are adequate. The theoretical foundation of the test and its psy- 
chometric properties are presented in the SB-IV kit’s manuals (Thorndike et al., 1986). Guidelines for 
interpreting individual results and specific details of administration are provided in a stand-alone hand- 
book (Delaney & Hopkins, 1987). The SB-IV contains attractive manipulative materials, but because 
they are downward extensions of test items for older children, they could use some modification that 
would result in a higher degree of engagement of younger children. 

The SB-IV is effective in alternating verbal and nonverbal subtests. The completion time, on average, is 
an hour. However, the administration of some subtests is rather awkward, sometimes requiring change 
of stimulus material or shift in instructions. The SB-IV requires minimum expressive language; ges- 
tures are not acceptable responses. The instructions are lengthy and use a large number of basic con- 
cepts (up to 25, Alfonso & Flanagan, 1999). Only four of the eight subtests of the SB-IV for preschool- 
ers utilize elements of a dynamic-testing approach by including sample items and providing young 
children an opportunity to learn the task. In addition, the test does not have alternative stopping rules 
(e.g., the option of stopping the test after an established number of consecutive failures). 

The Verbal Reasoning cluster subtests are highly sensitive to North American acculturation. The Short- 
Term Memory subtest is moderately sensitive, and the Abstract/Visual Reasoning subtests are only 
slightly sensitive. 



The test directions of the SB-IV and children’s verbal responses are not commercially available in lan- 
guages other than English. There are no norms for individuals from cultures outside the U. S. A. 

Overall, the SB-IV is characterized by adequate psychometric indicators. It is considered to be a valid 
measure Of so-called general intellectual functioning. The data on construct validity are somewhat 
inconsistent and the factor structure of the instrument is an unresolved question. The test is well- 
described (i.e., its theoretical framework is clear and its structure is transparent), has minimal expres- 
sive language requirements, and takes about 60 minute to administer. But the instructions are wordy 
and the administration of the test is sometimes cumbersome. Two main weaknesses of the SB-IV are 
the lack of interpretability of area-specific subtests and inadequate floors and item gradients. 

The Cattell Infant Intelligence Scale. The Cattell Infant Intelligence Scale (Cattell, 1960) was conceptu- 
• alized as a downward extension of the 1937 Stanford-Binet Intelligence Scale. The Cattell was 
designed to assess infants and toddlers from 2- through 30-months-old. In constructing the instrument, 
Cattell relied heavy on Gesell’s work, using the same or similar items as did Gesell in his instrument. In 
order closely to match the g-factor paradigm, however, items addressing gross motor and personal- 
social development were excluded. The reliability indicators were adequate, but the predictive validity 
was low (the correlations between the Cattell and the SBIS at 36 months were very low for 2^year-olds, 
but somewhat higher for 24-30-months-olds; Thomas, 1970). Overall, though designed specifically as 
an extension of the g-based tests to early childhood, the Cattell appears unable to improve the predic- 
tive power of indicators of cognitive development in early childhood. • ; 

The McCarthy Scales of Children ’s Abilities. The McCarthy Scales of Children’s Abilities (McCarthy, 
1972) form a well standardized and psychometrically sound measure of the cognitive abilities of young 
children (ages 2 1/2 to 8 1/2 years). The test is individually administered and takes about.45 to 60 min- 
utes to administer, depending on the age of the child. The McCarthy Scales have some unique features 
valuable for the assessment of young children with learning problems or other exceptionalities (Sattler, 
1992). The test produces a general measure of intellectual functioning called the General Cognitive 
Index (GCI), as well as a profile of abilities that includes measures of verbal ability, nonverbal reason- 
ing ability, number aptitude, short-term memory, coordination, and hand dominance. 

The scale indices derived from the McCarthy Scales subtest are standard scores, with a mean of 50 and 
a standard deviation of 10. The overall General Cognitive Index has a mean of 100 and a standard devi- 
ation of 16. This index is considered to be an indicator of the child’s ability to integrate his or her accu- 
mulated knowledge and to adapt that knowledge in order to perform the tasks on the scales. 

The standardization of the McCarthy Scales was excellent (N— 1032). The sample was representative 
of the general population of the U.S.A. (on the variables of age, sex, race and ethnicity, geographic 
region, father’s occupation, and urban-rural residence). 

The psychometric properties of this scale are also very good, with median split-half, internal-consisten- 
cy, and test-retest reliabilities ranging between .85 and .93 (Sattler, 1992). The concurrent validity of 
the McCarthy Scales is acceptable with the Stanford-Binet, WISC, WISC-R, WPPSI, and K-ABC (with 
correlations ranging from .45 to .90). Construct validity, however, appears to be questionable, with dif- 
ferent numbers of factors revealed in different studies and for boys and girls. 
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The McCarthy Scales’ Manual is. comprehensive and easy to use. Materials are well-constructed and 
appeal to children. 

The Differential Ability Scales. The Differential Ability Scales (DAS; Elliot, 1990a) form an individual- 
ly administered battery of cognitive and achievement tests for children and adolescents from ages 2 1/2 
years through 17 years. The Cognitive Battery is organized into a set of core subtests that yield a Gen- 
eral Conceptual Ability (GCA) score and a set of diagnostic subtests that provide additional information 
on specific abilities. There is also an intermediate layer of so-called cluster scores, linking specific sub- 
tests to the GCA score. The structure of the test is flexible and age-dependent. Thus, for children aged 2 
years, 6 months to 3 years, 5 months, there are no cluster scores because abilities are relatively, undif- 
ferentiated at this young age. The GCA score is a function of scores on core subtests (Block Building, 
Verbal Comprehension, Picture Similarities, and Naming Vocabulary) and diagnostic subtests (Recall of 
Digits, Recognition of Pictures). For children aged 3 years, 6 months, to 5 years, 11 months, the ability 
clusters are verbal (core subtests are Verbal Comprehension and Naming Vocabulary) and nonverbal 
(core subtests are Picture Similarities, Pattern Constructions, and Copying), and the diagnostic subtests 
are Block Building, Matching Letter-Like Forms, Recall of Digits, and Recognition of Pictures. For 
children aged 6 years to 17 years 11 months, the clusters are verbal ability (including core subtests of 
Word Definition and Similarities), nonverbal reasoning ability (core subtests of Matrices, and Sequen- 
tial and Quantitative Reasoning), and spatial ability (including subtests of Recall Designs and Pattern 
Construction). The diagnostic subtests are Recall of Digits, Recall of Objects, and Speed of Information 
Processing. The GCA is viewed as providing an estimate of so-called general intelligence (Elliott, 
1990b). 

Elliot does not provide any explicit theoretical framework underlying the DAS. McGrew and Flanagan 
(1996), however, view this instrument as yet another realization of the Gy-G c theory. 

All DAS subtests have a mean of 50 and a standard deviation of 10, whereas all. composites have a 
mean of 100 and a standard deviation of 15. The administration time ranges between 35 and 60 min- 
utes; the time is determined by whether the diagnostic subtests are administered (Elliott, 1990a). 

The DAS was standardized on a U.S.-population sample, representative of the general U.S. population 
(by gender, geographic region, race/ethnicity, enrollment in educational programs, and SES). The sam- 
ple included 3,475 individuals, with 200 to 350 individuals per one-year interval. The DAS norm tables 
are saturated in blocks of 3 months. The DAS was standardized in 1987-1988; therefore, these norms 
are still adequate, although barely so. 

The DAS demonstrates adequate reliability and validity. Internal-consistency reliability coefficients are 
.90 or higher across the preschool range, with somewhat lower coefficients (.89) for children between 
the ages of 0-3 and 4-6 years. The test-retest evaluation was carried out on a representative sample, 
with an interval of four weeks, and was found to be .90. 

All DAS subtests but Verbal Comprehension have demonstrated adequate floors; the Verbal Compre- 
hension subtest floor becomes adequate by the age of 4 years and 4 months. However, all DAS com- 
posites have adequate floors across the preschool age range. The DAS generally has adequate item gra- 
dients at the middle and upper end of the preschool range, but two subtests ( Block Design and Naming 
Vocabulary) have inadequate item gradients at the lower preschool age. 
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Factor analyses (e.g., Keith, 1990) suggest a one-factor (g-factor) solution for the DAS subtest data col- 
lected on the youngest children, and a two-factor solution for the data collected On older preschoolers 
(starting at the age of 3 years and 6 months). As for criterion validity, dozens of studies have been car- 
ried out comparing the DAS with other major intelligence tests, such as the WPPSI-R, the Woodcock- 
Johnson (WJ-R) , and Kaufman Assessment Battery for Children (K-ABC). In general, correlations 
between the total scores on the DAS and other instruments have been moderate to high (McGrew & 
Flanagan, 1996). Therefore, the DAS appears to be a valid measure of intellectual functioning for most 
ages and for a variety of exceptional subpopulations. 

The qualitative characteristics of the DAS range from adequate to good (Table 2). The manual contains 
sufficient information on the theoretical background, administration, scoring, interpretation, and psy- 
chometric properties of the test. The DAS materials are age-appropriate, stimulating, and engaging. To 
involve children in the process of testing, the DAS alternates verbal and nonverbal subtests, beginning 
with stimulating tasks. Many of the DAS subtests require minimum expressive language skills, and ges- 
tures are acceptable responses. The DAS utilizes elements of dynamic testing by including sample and 
teaching items, second trials, and demonstrations to ensure that the child has understood the task. 
Another positive characteristic of the DAS is the inclusion of stopping rules. The main limitation of the 
test is the length of direction and the high number of basic concepts (23; Alfonso & Flanagan, 1999). 

The DAS has not been translated into languages other than English and does not include a system of 
core assumptions for translations. There are no norms available for individuals from other than North- 
American cultures. 

Similar to other g-based tests of intelligence, the importance of North- American acculturation for the 
DAS subtests ranges from high to low. The Verbal subtests (i.e., Verbal Comprehension and Naming 
Vocabulary) are highly dependent on exposure to North American culture, whereas the subtests com- 
prising the Nonverbal cluster (i.e., Copying, Pattern Construction, and Block Building) and the Special 
Nonverbal Composite (Block Building and Picture Similarities) are perhaps somewhat less susceptible 
to the impact of culture. 

Overall, the DAS, more than any other g-based instrument, appears to achieve a balance between good 
quantitative and qualitative indicators and, therefore, is highly regarded by professionals in the field of 
early childhood assessment (Alfonso & Flanagan, 1999). 

The Woodcock-Johnson Psycho-Educational Battery-Revised: Tests of Cognitive Ability. The Wood- 
cock- Johnson ( WJ-R COG, Woodcock & Johnson, 1989) is designed for individuals aged 24 months 
through 95+ years. The battery contains 21 tests of cognitive ability divided into standard and supple- 
mental batteries. The test is specifically based on the Hom-Cattell G^-G c theory (Horn, 1991, 1994; 
Horn & Noll, 1997) and the Three-Stratum Theory of Cognitive Abilities (Carroll, 1993, 1997), so the 
standard battery contains seven tests, one measure for each of seven G/-G c factors (Crystallized Abili- 
ty, Short-Term Memory, Visual Processing, Auditory Processing, and Long-Term Retrieval). The sup- 
plemental battery contains 14 tests. Of these, the first 7 tests (i.e., tests 8 through 14) provide comple- 
mentary measures of the seven Gy-G c Cognitive Ability Clusters. The remaining seven tests on the 
WJ-R COG supplemental battery (tests 15-21) provide mixed measures of Gy-G c abilities and may be 
administered to derive additional information about an individual’s cognitive strengths and weaknesses. 
Of the WJ-R COG’s 21 subtests, only five [Picture Vocabulary (a test of Crystallized Ability), Memory 
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for Sentences (a test of Short-Term Memory), Visual Closure (a test of Visual Processing), Incomplete 
Words (a test of Auditory Processing), and Memory for Names (a test of Long-Term Retrieval)] are 
applicable to children between the ages of 2 years and 5 years, 11 months. Combined, scores on these 
subtests yield the Broad Cognitive Ability Early Development (BCA-ED) cluster, which is interpreted 
as an estimate of general intellectual functioning (Woodcock & Mather, 1989). 

The WJ-R COG subtests have a mean of 100 and a standard deviation of 15. The time required for 
administration of the BCA-ED ranges from 20 to 30 minutes. 

The WJ-R COG standardization was carried out on a representative sample (approximating the.U. S. 
population on gender, geographic region, race/ethnicity, community size, and SES) comprising 6,359 
individuals with at least 100 individuals per one-year interval. However, only 705 children were includ- 
ed in the preschool sample, and no specific age-based information was reported on this sample. The 
norm table of the WJ-R COG is divided into one-month age blocks for children between the ages of 2 
and 5 years, 1 1 months. 

The WJ-R internal consistency estimates are good (> .93 across the preschool age range). The test- 
retest reliability sample included very few preschoolers and, therefore, is considered inadequate. How- 
ever, the overall test-retest reliability indicator for the WJ-R COG for this sample is .87. All BCA-ED 
subtests (except one: Incomplete Words) have adequate floors and item gradients throughout all 
preschool years. Therefore, the BCA-ED not only provides estimates of ability in the borderline range 
(i.e., at least 2 standard deviations below the normative mean) for very young children, in the middle 
and upper end of the preschool age range, but it is also sensitive to fine gradations in ability between 
the mean and 1 to 2 standard deviations below the mean (Alfonso & Flanagan, 1999). 

There is ample evidence supporting the construct validity of WJ-R COG, but most of this evidence has 
been collected for individuals aged 5 to 80+. Moreover, the BCD-ED Cluster has only one indicator 
(subtest) per ability. Therefore, Alfonso and Flanagan (1999) suggest interpreting the battery for 
preschoolers as a measure of general cognitive ability (so-called general intelligence). The criterion 
validity indicators are good, demonstrating a moderate to high degree of similarity among intelligence . 
tests for preschoolers (McGrew & Flanagan, 1996). 

The qualitative characteristics of WJ-R COG range from poor to adequate. The manual is comprehen- 
sive, containing the necessary information on the test’s underlying theory, development, psychometric 
properties, administration, scoring, and interpretation. However, test material does not include manipu- 
latives, and appears to be of rather low interest to children. Moreover, the tests’ answers emphasize 
expressive language and do not accept gestures as adequate responses. 

The WJ-R COG is the only cognitive test for young children that has a parallel form in Spanish 
(Bateria-R COG, Woodcock & Munoz- Sandoval, 1996). The instructions are brief and comprehensive, 
with a minimum number of basic concepts (12, Alfonso & Flanagan, 1999). The WJ-R COG does not 
utilize elements of dynamic testing. Moreover, the test does not include alternative stopping rules. 

Like the other tests discussed here, the WJ-R COG’s dependency on North American acculturation is 
distributed unevenly (Hessler, 1993); this dependency is quite pronounced for some subtests (e.g., Pic- 
ture Vocabulary) and less so for others (e.g., Memory for Names). 
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The WJ-R COG is unique in that it has subtest floors and item gradients that are generally good 
throughout the preschool age range, rendering it a sensitive instrument for quantifying fine gradations 
in ability across various levels of cognitive functioning. In general, the WJ-R GOG is designed to mea- 
sure a broader range of abilities than is usually captured by intelligence tests. Unfortunately, these abili- 
ties are underrepresented in the WJ-R COG variant for young children. Moreover, due primarily to the 
limited attractiveness of the stimuli for younger children, the lack of discontinuation rules, and a certain 
dependency on verbal responses, the test is rarely used in early-child assessment. 

The Kaufman Assessment Battery for Children. The Kaufman Assessment Battery for Children 
(K-ABG; Kaufman & Kaufman, 1983) measures both intelligence and achievement. It is designed to 
assess functioning in both normal and exceptional children of ages 2 1/2 through 12 1/2 years. Four 
global indicators of functioning are assessed: Sequential Processing, Simultaneous Processing, (com- 
bined into the Mental Processing Composite), Nonverbal Performance, and Achievement (an indicator 
of the optimality of the application of the two types of processing in the context of academic-skills 
mastery). There are a total of 16 subtests (3 sequential, 7 simultaneous, and 6 achievement), but not all 
subtests are administered at every age (no more than 13 are administered to any one child). Only three 
subtests (Hand Movements, Gestalt Closure, and Faces and Places) run throughout the ages covered by 
the battery. The K-ABC is intended for use in school and clinical settings, with an administration time 
being approximately 45 minutes for preschool children and about 75 minutes for those of school-age 
children. 

Unlike the other tests, this test has a solid theoretical basis and draws on Alexander Luria’s (1980) theo- 
ry of the functional systems of the brain. The theoretical framework underlying the K-ABC makes a 
distinction between sequential and simultaneous mental processes. Sequential processing refers to the 
child’s ability to solve problems by mentally arranging input in sequential or serial order. This type of 
processing is crucial for such mental operations as learning grammatical relationships and rules, under- 
standing the chronology of events, and making associations between sounds and letters. Corresponding- 
ly, the Sequential Subscale contains three subtests: Hand Movements, Number Recall, and Word Order. 
Simultaneous processing refers to the child’s ability to synthesize information in order to solve a prob- 
lem. This type of processing is fundamental to learning the shapes of letters, deriving meaning from 
pictorial stimuli, or determining the main idea of a story. The Simultaneous Scale contains seven sub- 
tests: Magic Window, Face Recognition, Gestalt Closure, Triangles, Matrix Analogies, Spatial Memory, 
and Photo Series. The Achievement Scale contains six subtests: Expressive Vocabulary, Faces and 
Places, Arithmetic, Riddles, Reading/Decoding, and Reading/Understanding. The score on Nonverbal 
Performance is composed of the scores of those subtests that form the Sequential and Simultaneous 
Processing Scales (Face Recognition, Hand Movement, Triangles, Matrix Analogies, Spatial Memory, 
and Photo Series) that do not require words. 

The Sequential and Simultaneous Processing Scales were designed to reduce the effects of verbal pro- 
cessing. Moreover, the test was intentionally designed to minimize the effects of gender, ethnic, and 
North- American cultural bias. 

Raw scores for the subscales are converted into scaled scores with a mean of 10 and a standard devia- 
tion of 3; the global scales are transformed into standard scores with a mean of 100 and a standard 
deviation of 15. The standardization of the K-ABC was adequate; it was conducted on a large North- 
American sample ( N = 2000). However, the sample shows some representation problems with Hispan- 



ic-Americans. In addition, low-educational-level Blacks were significantly underrepresented.. Internal 
consistency reliabilities are satisfactory, ranging from .86 to .97 for composite scales. Stability of the 
K-ABC as assessed by the means of test-retest reliability is also adequate (Sattler, 1992). The K-ABC 
was validated against many other tests of children’s cognitive functioning (see Appendix). The concur- 
rent-validity indicators are satisfactory. Item distribution characteristics are adequate, but the K-ABC 
has a low ceiling that may limit its usefulness in evaluating gifted children. Over half of the subtests on 
the Simultaneous and Sequential Processing Scales provide maximum scores that are only 2 standard; 
deviations or less above the mean. The Achievement Scale also has a restricted range. 

The K-ABC is recognized as an instrument useful in certain situations, especially those requiring t - 
emphasis on nonverbal cognitive abilities. However, the K-ABC is not recommended for use as the pri- 
mary instrument for identifying the intellectual abilities of normal or special children either in research 
or in clinical settings (Sattler, 1982). 

The Standard Raven Progressive Matrices (SPM; Raven, 1960), drawing on Spearman’s (1923, 1927) 
theory of general ability, consists of 60 matrix problems, which are separated into five sets of 12 . 
designs each. Within each set of 12, the problems become increasingly difficult. Each individual. design 
has a missing piece. The participant’s task is to select the correct piece to complete the design from ; 
among six to eight alternatives. Correct responses are based on various organizing principles, such as 
increasing size, reduced or increased complexity, and number of elements. The SPM uses nonverbal 
stimuli, and it is assumed that it does not require a specific knowledge base. A separate test, referred to 
as Coloured Progressive Matrices (Raven, 1965), has been developed for children in the 5-11 age range 
and the elderly (65+ years of age). Similarly, persons believed to be of high intellectual ability can be 
administered the Advanced Progressive Matrices (Raven et al., .1992). The SPM is considered one of 
the most reliable instruments for measuring general intelligence, especially its fluid aspects (Court, 
1988; Raven, 1989). The latest edition of the tests was published in 1995. This test series generally is 
not appropriate for preschoolers. 

Criterion/Norm-Referenced Assessment 

Norm-referenced comparison is the most commonly used method of comparison. It is especially preva- 
lent in mental-health assessment and involves comparing particular observed behaviors to those of a 
large representative sample of children. In other words, diagnostic and screening tests are used to com- 
pare a child’s current level of functioning to that of other children. Criterion-based comparison usually 
involves comparing a child’s performance to some set of expectations or set of standards, such as those 
indicative of school readiness. Criterion-based instruments typically include many different items 
intended to reflect all important developmental stages and competencies of various ages. 

The Brigance Diagnostics Inventory of Early Development — Revised. The Brigance ( BDIED , Brigance, 
1991) is one of the most popular criterion-referenced tests used with young children (from birth to 7 
years). The Brigance surveys skills in 12 different developmental domains, including social and emo- 
tional, communicative, motor, and pre-academic skills (reading, math, and manuscript writing). There 
is, however, no information regarding the reliability and validity of the BDIED (Bagnato, 1985; Car- 
penter, 1994). The BRIGANCE K & 1 Screen (Brigance, 1991) is a shorter version of the BDIED. 
Some other criterion-referenced assessments are Developmental Programming for Infants and Young 
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Children (Schafer & Moersch, 1981), The Hawaii Early Learning Profile (HELP) (Furuno et al., 1987), 
and The Early Learning Accomplishment Profile for Infants (Sanford, 1981). 

The Infant Psychological Developmental Scale (Uzgiris & Hunt, 1975) is one of the most commonly 
used instruments based upon the Piagetian model. It assesses an infant’s object-permanence ability, 
ability to understand means-ends and cause-effect relationships, ability to imitate vocalizations and ges- 
tures, and ability to manipulate objects in space. 

The Metropolitan Readiness Test ( MRT, Nurss & McGauvran, 1986) is currently in its fifth edition. This 
is a group-administered test measuring young children’s fundamental competencies in reading, mathe- 
matics, and language-based activities. The administration of the test usually takes 80-90 minutes. 

The MRT’s Level I is designed for preschool children (4-6 year-olds). This level includes the following 
subtests: Auditory Memory, Beginning Consonants, Letter Reasoning, Visual Matching, School Lan- 
guage and Listening, and Quantitative Language. Level II is designed for kindergarten graduates and 
first-graders. The Level-II subtests are Beginning Consonants. Sound-Letter Correspondence, Visual 
Matching, Finding Patterns, School Language, Quantitative Concepts, and Quantitative Operations. 

The Peabody Individual Achievement Test-Revised ( PIAT-R , Markwardt, 1989) is a norm-referenced test 
(ages 5 to 18) of school achievement. It is administered individually and assesses performance in six 
content areas (general information, reading, recognition, reading comprehension, mathematics, spelling, 
and written expression). Level I of the PIAT-R is designed for kindergarteners and first graders and 
includes a number of prewriting skills (e.g., copying and writing letters). The PIAT-R was standardized 
on a representative sample of 1,563 students (K-12) and 175 kindergarteners. Most internal-consistency 
reliability indicators are over .90 with the exception of mathematics at the kindergarten level (.84). The 
criterion validity, as measured by correlations with a number of other tests (Sattler, 1992), is adequate. 

Screening Devices 

When large-scale assessments should be carried out, or when there is a need to determine which chil- 
dren may be developmentally at-risk and require further assessment, assessors may turn to certain spe- 
cial early child developmental screening devices. Such devices are somewhat predictive of scores from 
comprehensive assessments but require substantially less time to administer and score. 

Due to their brevity, these instruments are neither as reliable nor as valid as comprehensive assessment 
tools. The psychometric properties of developmental screeners are defined by characteristics of the 
tests’ sensitivity (the number of “misses,” i.e., the degree of accuracy in detecting children with delays 
or disabilities) and specificity (the proportion of “false alarms,” i.e., mislabeling children as delayed or 
disabled when in fact the children are developmentally normal). It has been recommended (Meisels, 
1989) that both sensitivity and specificity levels of screening devices should be at least 80%. Although 
both indicators are important, it is usually assumed that sensitivity is a more critical characteristic of a 
screening device — follow-up assessments will correct false positives, whereas false negatives usually 
will not be referred for further assessment. 

The Denver Developmental Screening Test — II. The Denver Developmental Screening Test — II 
(Frankenburg et al., 1990) is one of the most popular developmental screening tests (with an age range 
of 1 month to 6 years). The driving factor of the Denver is its brevity — it takes 15-20 minutes to 
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administer. The content of the test includes personal-social, motor, language, and adaptive domains. 

The scoring is based on parent reports, direct child assessment, and observation. The assessment result 
is expressed by a single score assigning the child to one of the four descriptive categories: Pass, Ques- 
tionable, Abnormal, or Untestable. 

The standardization of the Denver — II was done exclusively using young children from Colorado; 
therefore, the norms should be used cautiously. 

The original version of the Denver (Frankenburg, Dodds, & Fandal, 1975) was found to be insufficient- 
ly sensitive to identify correctly most children with developmental delays or disabilities (Greer, Bauch- 
ner & Zuckerman, 1989). The sensitivity of the Denver — II has been greatly improved, but there is now 
evidence that it significantly over-identifies children as developmentally delayed or disabled (Galscoe 
& Byrne, 1993; Johnson, Ashford, Byrne, & Glascoe, 1992). The reported test-retest reliability is .90 
and the reported inter-rater reliability is .98. 

The Denver-II provides forms in Spanish, but no norms for children of diverse cultural backgrounds are 
available. 

Other tests. The Early Screening Profile (ESP; Harrison, 1990) and the Developmental Indicators for 
the Assessment of Learning (DIAL-R, Mardell-Czudnowski & Goldberg, 1990) are examples of devel- 
opmental screening instruments capitalizing primarily on direct assessment of the child. Both instru- 
ments are applicable to children 2- to 6-years-old, both take about 30 minutes to administer, both assess 
motor, language, and cognitive functioning (EPS also evaluates a child’s personal-social and adaptive 
behaviors), both are normed on large, representative samples of children, and both exemplify some of 
the soundest validation available in the developmental-screening field. Test-retest reliability coefficients 
were .87 for the DIAL-R and .78-. 89 for the ESP. Both instruments are available in English only. 

In contrast to the ESP and DIAL-R, the Developmental Profile-II (DP-II; Alper, Boll, & Shearer, 1986) 
and the Developmental Observation Checklist System (DOCS; Hrescko, Miguel, Sherbenou, 1994) 
solely utilize data obtained from caregivers’ reports. Both tools demonstrate adequate psychometric 
properties, evaluate children in a number of domains (DP-II: muscle and motor abilities, self-help, 
social, cognitive/intellectual, expressive and receptive communication; DOCS: language, motor, social, 
and cognitive development, child’s adjustment behavior, and levels of family stress and support), and 
were standardized on large, nationally representative samples of children. The age-frame of the DOCS 
is 0 to 6 years and the age-frame of the DP-II is birth to 9 1/2 years. 

The BRIGANCE Preschool Screen (Brigance, 1985) assesses children between 3 and 4 years of age. 
The administration takes 10-15 minutes. The reported internal-consistency reliability coefficient is .82 
and the reported. test-retest reliability indicator is .97. The device evaluates children’s functioning in 
motor, language, body parts, colors, and personal domains. The standardization sample is small (408 
children only). The test directions are available in English and Spanish. 

The Miller Assessment for Preschoolers (MAP, Miller, 1988) is designed to screen children between 2.9 
and 5.8 years of age. The administration of the MAP takes 25-30 minutes. The reported test-retest relia- 
bility ranges between .81 and .98. The MAP assesses children in three domains: motor, language, and 
cognition. The MAP is available in English only. 
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IV. Concluding Remarks 

In closing this discussion, we would like to make three remarks. 

First, there is no single instrument whose properties in all domains of comparison (qualitative and 
quantitative) stand out uniformly. For example, some instruments are psychometrically more sound, 
whereas others appear to be more developmentally appropriate, or include more engaging and more 
appealing materials. Some instruments are more theoretically grounded whereas others are primarily 
empirically driven. Some instruments are better in differentiating the upper tail of the cognitive ability 
distribution, whereas others better differentiate the lower tail. 

Second, there is no unified rule regarding which test should be used when. Multiple issues should be 
considered when a decision is made in selecting a test. These issues should include (but are not be lim- 
ited to) the purpose of testing, conditions of testing; the tester’s expertise, the availability of materials, 
and the cost. 

Finally, when a test is considered for use in a culture different from that where it was developed and 
standardized, an “implantation” in a different culture should be carried out very cautiously. Many (if 
not all) tests are culturally biased. They tend to favor the performance of children raised in the culture 
in which the test was created and suppress the performance of children from different cultures. 
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